Contents

This vignette explains how to perform ionomics data analysis including gene network and enrichment analysis by using the modification of R package, ionflow. The modification(ionflow_funcs) was made by Wanchang Lin () and Jacopo Iacovacci().

Data preparation

To explore the process, we’ll use the ionomics data set:

ion_data <- read.table("../test-data/iondata.tsv", header = T, sep = "\t")
dim(ion_data)
#> [1] 9999   16

Ten random data records are shown as:

sample_n(ion_data, 10)
Table 1: Samples of raw data
Knockout Batch_ID Ca Cd Co Cu Fe K Mg Mn Mo Na Ni P S Zn
YBR245C 6 4.17 0.68 0.14 0.71 1.24 1585.72 216.91 0.56 0.36 64.52 0.34 1457.18 106.27 10.48
YBR233W 6 5.60 0.75 0.12 0.84 1.87 1867.51 248.07 0.69 0.46 42.25 0.39 1744.45 128.53 10.27
YCL045C 20 59.45 1.35 0.21 2.05 9.85 2407.76 629.56 1.54 1.98 173.73 1.56 4332.01 883.05 16.98
YCL042W 20 38.42 0.89 0.20 2.59 9.24 2800.92 667.87 1.21 1.42 158.78 1.49 4280.23 463.92 15.23
YEL072W 13 49.04 0.94 0.15 1.46 5.83 3421.35 701.39 1.36 1.38 174.30 1.40 4523.51 631.20 16.50
YGR033C 24 49.09 1.10 0.14 1.68 7.59 2876.15 657.95 1.30 0.97 229.08 1.51 5031.49 485.19 16.74
YDR200C 35 72.70 0.75 0.13 1.26 7.38 2659.38 701.96 1.44 1.00 326.01 0.74 4849.45 536.04 27.28
YHL010C 17 55.83 0.73 0.14 1.44 6.00 2920.87 674.88 1.38 1.41 225.81 1.12 4676.22 603.05 17.34
YCL005W 80 17.42 1.04 0.15 1.38 7.02 3003.23 861.34 1.21 1.23 425.52 1.07 5530.11 450.80 15.08
YLR214W 22 35.18 1.03 0.22 1.97 10.05 3080.21 708.54 1.43 0.69 174.54 1.68 4589.09 476.63 15.36

The first few columns are meta information such as gene ORF and batch id. The rest is the ionomics data.

Data pre-process

The raw data set should be pre-processed. The pre-processing function PreProcessing has functions:

The raw data are at first log transformed and then followed by the batch correction. User can chose not to perform batch correction, otherwise the user will use either median or median plus std method. If there is quality control for the batch correction, the user can use it and indicates in the argument of control_lines. Also one argument gives user option how to use these control line (control_use): If control_use is control, these control lines (data rows) are used for the batch correction factor; if control.out, lines except control lines are used.

This data set has a control line: YDL227C mutant. The code segment below is to identify it:

max(with(ion_data, table(Knockout)))
#> [1] 1617
which.max(with(ion_data, table(Knockout)))
#> YDL227C 
#>     209

The next stage is outlier detection. Here only univariate methods are implemented, including mad, IQR, and log.FC.dist. And like batch correction, user can skip this procedure by setting method_outliers = none in the function argument. There is a threshold to control the number of outliers. The larger the threshold (thres_outl) the more outlier removal.

Standardisation provides three methods: std, mad or custom. If the method is custom, user uses a specific std file like:

std <- read.table("../test-data/user_std.tsv", header = T, sep = "\t")
std
#>    Ion     sd
#> 1   Ca 0.1508
#> 2   Cd 0.0573
#> 3   Co 0.0580
#> 4   Cu 0.0735
#> 5   Fe 0.1639
#> 6    K 0.0940
#> 7   Mg 0.0597
#> 8   Mn 0.0771
#> 9   Mo 0.1142
#> 10  Na 0.1075
#> 11  Ni 0.0784
#> 12   P 0.0597
#> 13   S 0.0801
#> 14  Zn 0.0671

The pre-process procedure returns not only processed ionomics data but also a symbolic data set. This data set is based on the ionomics data and is determined by a threshold(thres_symb):

Note that the symbolic data set is important since the key part of network and enrichment analysis is based on the hierarchical clustering of symbolic data.

Let’s run the pre-process procedure:

pre <- PreProcessing(data = ion_data,
                     var_id = 1, batch_id = 2, data_id = 3,
                     method_norm = "median",
                     control_lines = "YDL227C",
                     control_use = "control",
                     method_outliers = "IQR",
                     thres_outl = 3,
                     stand_method = "std",
                     stdev = NULL,
                     thres_symb = 3)

names(pre)
#> [1] "stats.raw_data"    "stats.outliers"    "stats.batch_data" 
#> [4] "data.long"         "data.gene.logFC"   "data.gene.zscores"
#> [7] "data.gene.symb"    "plot.dot"          "plot.hist"

The results include summaries of raw data and processed data. The latter is:

pre$stats.batch_data %>% 
  kable(caption = 'Processed data summary', digits = 2, booktabs = T) %>%
  kable_styling(full_width = F, font_size = 10)
Table 2: Processed data summary
Ion Min. 1st Qu. Median Mean 3rd Qu. Max. Variance
Ca -4.45 -0.28 -0.13 -0.12 0.02 2.35 0.11
Cd -1.70 0.03 0.10 0.11 0.17 0.93 0.03
Co -2.80 0.02 0.09 0.06 0.15 1.60 0.05
Cu -0.66 -0.10 -0.03 -0.01 0.04 5.28 0.04
Fe -7.48 -0.17 -0.06 -0.02 0.07 6.88 0.14
K -2.21 -0.17 -0.01 -0.08 0.09 1.83 0.08
Mg -1.84 -0.06 0.01 -0.01 0.07 1.69 0.03
Mn -4.11 -0.24 -0.08 -0.13 0.01 1.78 0.06
Mo -2.03 -0.26 -0.08 -0.08 0.09 4.44 0.13
Na -7.41 -0.53 -0.22 -0.33 -0.04 1.25 0.24
Ni -2.40 -0.01 0.09 0.12 0.21 7.90 0.12
P -1.18 -0.06 0.00 -0.01 0.06 1.45 0.02
S -2.38 -0.03 0.05 0.06 0.16 2.38 0.04
Zn -0.46 -0.08 -0.03 -0.01 0.03 4.60 0.02

The pre-processed data and symbolic data are like like:

pre$data.gene.zscores %>% head() %>%
  kable(caption = 'Processed data', digits = 2, booktabs = T) %>% 
  kable_styling(full_width = F, font_size = 10,
                latex_options = c("striped", "scale_down"))
Table 3: Processed data
Line Ca Cd Co Cu Fe K Mg Mn Mo Na Ni P S Zn
YAL004W -1.16 0.75 1.19 -0.47 0.04 0.61 0.51 -0.84 -0.08 -1.84 1.71 0.52 0.33 -0.09
YAL005C -1.67 0.84 0.55 0.58 -2.79 0.59 0.31 -1.16 -1.42 -0.12 1.48 0.73 0.13 -0.13
YAL007C -2.12 0.64 0.23 -0.53 -0.24 0.79 -0.09 -0.14 1.22 -0.92 0.00 0.09 -0.29 -0.65
YAL008W -2.34 1.13 0.21 -0.73 -2.16 0.52 -0.02 -0.87 0.93 -0.58 0.02 -0.09 -0.73 -0.47
YAL009W -1.18 0.66 0.55 -1.11 -3.91 0.22 0.09 -0.18 1.50 -0.84 -0.09 0.14 0.01 -0.36
YAL010C -1.28 1.43 2.27 0.46 1.53 -2.75 0.04 -0.74 -9.71 -4.30 2.42 -0.98 -0.05 -0.01

pre$data.gene.symb %>% head() %>%
  kable(caption = 'Symbolic data', booktabs = T) %>%
  kable_styling(full_width = F, font_size = 10)
Table 3: Symbolic data
Line Ca Cd Co Cu Fe K Mg Mn Mo Na Ni P S Zn
YAL004W 0 0 0 0 0 0 0 0 0 0 0 0 0 0
YAL005C 0 0 0 0 0 0 0 0 0 0 0 0 0 0
YAL007C 0 0 0 0 0 0 0 0 0 0 0 0 0 0
YAL008W 0 0 0 0 0 0 0 0 0 0 0 0 0 0
YAL009W 0 0 0 0 -1 0 0 0 0 0 0 0 0 0
YAL010C 0 0 0 0 0 0 0 0 -1 -1 0 0 0 0

The symbolic data are calculated from the processed data with control of thres_symb (here is 3). You can obtain a new symbol data set by re-assigning a new threshold to the function symbol_data:

data_symb <- symbol_data(pre$data.gene.zscores, thres_symb = 2)
data_symb %>% head() %>%
  kable(caption = 'Symbolic data with threshold of 2', booktabs = T) %>%
  kable_styling(full_width = F, font_size = 10)
Table 4: Symbolic data with threshold of 2
Line Ca Cd Co Cu Fe K Mg Mn Mo Na Ni P S Zn
YAL004W 0 0 0 0 0 0 0 0 0 0 0 0 0 0
YAL005C 0 0 0 0 -1 0 0 0 0 0 0 0 0 0
YAL007C -1 0 0 0 0 0 0 0 0 0 0 0 0 0
YAL008W -1 0 0 0 -1 0 0 0 0 0 0 0 0 0
YAL009W 0 0 0 0 -1 0 0 0 0 0 0 0 0 0
YAL010C 0 0 1 0 0 -1 0 0 -1 -1 1 0 0 0

The thres_symb is a crucial value to get the symbolic data. Before re-setting this threshold, the user should check the summary of processed data and pay more attention to the maximum values. For example, some ions (for example, Cd and Mn) are all zero even with 2 of thres_symb.

The pre-processed data distribution is:

pre$plot.hist
Ionomcs data distribution plot

Figure 1: Ionomcs data distribution plot

Data filtering

There are a lot of ways to filter genes. Here genes are filtered according to symbolic data: remove genes with all values are zero.

data <- pre$data.gene.zscores
data_symb <- pre$data.gene.symb
idx <- rowSums(abs(data_symb[, -1])) > 0
dat <- data[idx, ]
dat_symb <- data_symb[idx, ]
dim(dat)
#> [1] 549  15

Data clustering

The hierarchical cluster analysis is the key part of gene network and gene enrichment analysis. The methodology is as follow:

One example is:

clust <- gene_clus(dat_symb[, -1], min_clust_size = 10)
names(clust)
#> [1] "clus"    "idx"     "tab"     "tab_sub"

The cluster centres are:

clust$tab_sub
#>   cluster nGenes
#> 1      11     72
#> 2       7     36
#> 3       1     27
#> 4      18     15
#> 5       5     12
#> 6       3     11
#> 7       8     11

It indicates that clusters and their number of genes (larger than min_cluster_size).

The identified gene located in those clusters are:

sum(clust$idx)                    #' numbers of all genes
#> [1] 184
as.character(dat[,1][clust$idx])  #' and they are
#>   [1] "YAL009W"   "YAL013W"   "YAL014C"   "YAL022C"   "YAL027W"   "YAL042W"  
#>   [7] "YAL046C"   "YAL051W"   "YAL066W"   "YAR003W"   "YAR031W"   "YAR035W"  
#>  [13] "YBR185C"   "YBR194W"   "YBR195C"   "YBR199W"   "YBR204C"   "YBR216C"  
#>  [19] "YBR217W"   "YBR218C"   "YBR219C"   "YBR221C"   "YBR233W"   "YBR240C"  
#>  [25] "YBR245C"   "YBR249C"   "YBR260C"   "YCL032W"   "YCL033C"   "YCL050C"  
#>  [31] "YCL060C"   "YCR005C"   "YDR063W"   "YDR095C"   "YDR096W"   "YDR109C"  
#>  [37] "YDR127W"   "YDR133C"   "YDR135C"   "YDR137W"   "YDR143C"   "YDR146C"  
#>  [43] "YDR147W"   "YDR158W"   "YDR181C"   "YDR183W"   "YDR186C"   "YDR198C"  
#>  [49] "YDR209C"   "YDR220C"   "YDR221W"   "YDR222W"   "YDR223W"   "YDR227W"  
#>  [55] "YDR338C"   "YDR344C"   "YDR370C"   "YDR374C"   "YDR379W"   "YDR415C"  
#>  [61] "YDR430C"   "YEL003W"   "YEL005C"   "YEL008W"   "YEL046C"   "YER004W"  
#>  [67] "YER007C-A" "YER007W"   "YER032W"   "YER034W"   "YER053C"   "YER054C"  
#>  [73] "YER056C-A" "YER067W"   "YER068C-A" "YER074W"   "YER086W"   "YGL205W"  
#>  [79] "YGL213C"   "YGL217C"   "YGL241W"   "YGL257C"   "YGR007W"   "YGR008C"  
#>  [85] "YGR034W"   "YGR043C"   "YGR045C"   "YGR070W"   "YGR071C"   "YGR077C"  
#>  [91] "YGR079W"   "YGR087C"   "YGR122W"   "YGR144W"   "YGR169C"   "YGR182C"  
#>  [97] "YGR187C"   "YGR193C"   "YGR194C"   "YGR203W"   "YGR205W"   "YGR207C"  
#> [103] "YHL006C"   "YHL019C"   "YHL021C"   "YHL028W"   "YHL032C"   "YHL037C"  
#> [109] "YHL041W"   "YHL046C"   "YHR012W"   "YHR046C"   "YHR057C"   "YHR073W"  
#> [115] "YHR075C"   "YHR123W"   "YHR143W"   "YHR150W"   "YHR152W"   "YHR155W"  
#> [121] "YJL148W"   "YJL162C"   "YJL198W"   "YJL199C"   "YJL206C-A" "YKL023W"  
#> [127] "YKL027W"   "YKL114C"   "YKL127W"   "YKL129C"   "YKL132C"   "YKL136W"  
#> [133] "YKL147C"   "YKL148C"   "YKL150W"   "YKL151C"   "YKL158W"   "YKL161C"  
#> [139] "YKL163W"   "YKL164C"   "YKL166C"   "YKL171W"   "YKL174C"   "YLL019C"  
#> [145] "YLL020C"   "YLL021W"   "YLL029W"   "YLL038C"   "YLR130C"   "YLR171W"  
#> [151] "YLR172C"   "YLR177W"   "YLR182W"   "YLR213C"   "YLR214W"   "YLR220W"  
#> [157] "YLR266C"   "YLR278C"   "YLR313C"   "YLR348C"   "YLR351C"   "YLR354C"  
#> [163] "YLR363C"   "YLR376C"   "YLR380W"   "YLR381W"   "YLR398C"   "YLR451W"  
#> [169] "YOL153C"   "YOR097C"   "YOR126C"   "YOR131C"   "YOR142W"   "YOR163W"  
#> [175] "YOR188W"   "YOR208W"   "YOR213C"   "YOR216C"   "YOR223W"   "YOR228C"  
#> [181] "YOR229W"   "YOR245C"   "YOR247W"   "YOR255W"

Gene network

The gene network uses both the ionomics and symbolic data. The similarity measures on ionomics data are used for constructing network. Before creating network, these measurement are further filtered by:

The methods implemented are: pearson, spearman, kendall, cosine, mahal_cosine or hybrid_mahal_cosine. The first three methods are correleation methods and cosine is similar to correlation. For the last two methods, see publication: Extraction and Integration of Genetic Networks from Short-Profile Omic Data Sets for details.

For example, we use the Pearson correlation as similarity measure for network analysis:

net <- GeneNetwork(data = dat,
                   data_symb = dat_symb,
                   min_clust_size = 10,
                   thres_corr = 0.75,
                   method_corr = "pearson")

The network with nodes coloured by the symbolic data clustering is:

net$plot.pnet1
Network with Pearson correlation: symbolic clustering

Figure 2: Network with Pearson correlation: symbolic clustering

The same network, but nodes are coloured by the network community detection:

net$plot.pnet2
Network with Pearson correlation: community detction

Figure 3: Network with Pearson correlation: community detction

The network analysis also returns a network impact and betweenness plot:

net$plot.impact_betweenness
Network with Pearson correlation: impact and betweeness

Figure 4: Network with Pearson correlation: impact and betweeness

For the comparison purpose, we use different similarity methods. Here we choose Cosine:

net_1 <- GeneNetwork(data = dat,
                     data_symb = dat_symb,
                     min_clust_size = 10,
                     thres_corr = 0.75,
                     method_corr = "cosine")
net_1$plot.pnet1
Network analysis based on Cosine

Figure 5: Network analysis based on Cosine

net_1$plot.pnet2
Network analysis based on Cosine

Figure 6: Network analysis based on Cosine

Use Hybrid Mahalanobis Cosine:

net_2 <- GeneNetwork(data = dat,
                     data_symb = dat_symb,
                     min_clust_size = 10,
                     thres_corr = 0.75,
                     method_corr = "mahal_cosine")
net_2$plot.pnet1
Network with Mahalanobis Cosine

Figure 7: Network with Mahalanobis Cosine

net_2$plot.pnet2
Network with Mahalanobis Cosine

Figure 8: Network with Mahalanobis Cosine

Again, we use Hybrid Mahalanobis Cosine:

net_3 <- GeneNetwork(data = dat,
                     data_symb = dat_symb,
                     min_clust_size = 10,
                     thres_corr = 0.75,
                     method_corr = "hybrid_mahal_cosine")
net_3$plot.pnet1
Network with Hybrid Mahalanobis Cosine

Figure 9: Network with Hybrid Mahalanobis Cosine

net_3$plot.pnet2
Network with Hybrid Mahalanobis Cosine

Figure 10: Network with Hybrid Mahalanobis Cosine

Enrichment analysis

The enrichment analysis is based on symbolic data clustering. The genes in clusters are considered target gene sets while genes in the whole data set is the universe gene set. The Bioconductor R package GOstats is used for the enrichment analysis.

The KEGG enrichment analysis, using a p-values of 0.05 and Genome wide annotation for Yeast, org.Sc.sgd.db:

kegg <- kegg_enrich(data = dat_symb, min_clust_size = 10, pval = 0.05,
                    annot_pkg =  "org.Sc.sgd.db")

#' kegg
kegg %>% 
  kable(caption = 'KEGG enrichmenat analysis',
        digits = 3, booktabs = T) %>%
  kable_styling(full_width = F, font_size = 10,
                latex_options = c("striped", "scale_down"))
Table 5: KEGG enrichmenat analysis
Cluster KEGGID Pvalue Count Size Term
Cluster 18 (15 genes) 00290 0.009 2 2 Valine, leucine and isoleucine biosynthesis
Cluster 18 (15 genes) 00520 0.009 2 2 Amino sugar and nucleotide sugar metabolism
Cluster 18 (15 genes) 00260 0.012 3 6 Glycine, serine and threonine metabolism
Cluster 18 (15 genes) 00010 0.024 2 3 Glycolysis / Gluconeogenesis
Cluster 18 (15 genes) 01110 0.037 5 22 Biosynthesis of secondary metabolites
Cluster 3 (11 genes) 00400 0.009 2 2 Phenylalanine, tyrosine and tryptophan biosynthesis
Cluster 8 (11 genes) 01100 0.006 6 55 Metabolic pathways
Cluster 8 (11 genes) 00564 0.027 2 6 Glycerophospholipid metabolism

Note that there could be none results for KEGG enrichment analysis. Change arguments such as min_clust_size as appropriate.

The GO Terms enrichment analysis with ontology of BP (other two are MF and CC):

go <- go_enrich(data = dat_symb, min_clust_size = 10, pval = 0.05,
                ont = "BP", annot_pkg =  "org.Sc.sgd.db")
#' go
go %>% head() %>% 
  kable(caption = 'GO Terms enrichmenat analysis',
        digits = 3, booktabs = T) %>%
  kable_styling(full_width = F, font_size = 10,
                latex_options = c("striped", "scale_down"))
Table 6: GO Terms enrichmenat analysis
Cluster ID Description Pvalue Count CountUniverse Ontology
Cluster 11 (72 genes) GO:0051336 regulation of hydrolase activity 0.0018 4 12 BP
Cluster 11 (72 genes) GO:0043085 positive regulation of catalytic activity 0.0044 4 15 BP
Cluster 11 (72 genes) GO:0035303 regulation of dephosphorylation 0.0068 2 3 BP
Cluster 11 (72 genes) GO:0046889 positive regulation of lipid biosynthetic process 0.0068 2 3 BP
Cluster 11 (72 genes) GO:1903727 positive regulation of phospholipid metabolic process 0.0068 2 3 BP
Cluster 11 (72 genes) GO:0044764 multi-organism cellular process 0.0074 3 9 BP

Exploratory analysis

The explanatory analysis performs PCA and correlation analysis for ions in terms of genes. Note that this analysis treats ions as samples/replicates while genes variables/features. The explanatory analysis is normally employed in the different stages of analysis.

For example, we apply it to the pre-processed data dat before any other analysis:

expl <- ExploratoryAnalysis(data = dat)
names(expl)
#> [1] "plot.pca"       "data.pca.load"  "plot.corr"      "plot.corr.heat"
#> [5] "plot.heat"      "plot.net"

The PCA plot is:

expl$plot.pca
Ion PCA plot on pre-processed data

Figure 11: Ion PCA plot on pre-processed data

The Person correlation of ions are shown in correlation plot, heatmap and network plot:

expl$plot.corr
Ion correlation plots on pre-processed data

Figure 12: Ion correlation plots on pre-processed data

expl$plot.corr.heat
Ion correlation plots on pre-processed data

Figure 13: Ion correlation plots on pre-processed data

expl$plot.net
Ion correlation plots on pre-processed data

Figure 14: Ion correlation plots on pre-processed data

The correlation between ions and genes are shown in heatmap with dendrogram:

expl$plot.heat
Correlation between ions and genes on pre-processed data

Figure 15: Correlation between ions and genes on pre-processed data

The explanatory analysis can be used after gene identification. Here for example after gene clustering analysis:

#' update data set with results of gene clustering
dat_clus <- dat[clust$idx, ]
dim(dat_clus)
#> [1] 184  15

expl.1 <- ExploratoryAnalysis(data = dat_clus)
Explanotory analysis after gene clustering

Figure 16: Explanotory analysis after gene clustering

Explanotory analysis after gene clustering

Figure 17: Explanotory analysis after gene clustering

Explanotory analysis after gene clustering

Figure 18: Explanotory analysis after gene clustering

expl.1$plot.pca
Explanotory analysis after gene clustering

Figure 19: Explanotory analysis after gene clustering

expl.1$plot.net
Explanotory analysis after gene clustering

Figure 20: Explanotory analysis after gene clustering